Model Selection

Multi-image understanding

# Multi-image understanding

Phi 3.5 Vision Instruct

Phi-3.5-vision is a lightweight and advanced open-source multimodal model that supports a 128K context length and focuses on processing high-quality, inference-rich text and visual data.

Transformers Other

Pixtral is a multimodal model based on the Mistral architecture that can handle image and text inputs and generate text outputs.

Pixtral-12B is a multimodal model compatible with the transformers library. It can handle image and text inputs and generate text outputs, suitable for image understanding and description tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase